library(atus)
library(dplyr)
library(rpart)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.2
library(ggplot2)
library(RColorBrewer)
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
library(splines)
Our group is using the atus database to investigate whether various demographic variables correlate to family income. Specifically, we will be looking at region, state, educational attainment, home ownership (rent vs. own), household size, race, and time usage. After visualizing the data, we will test the correlations between each of these demographic variables and income using linear regression or other related statistical analyses. Using the results from these statistical analyses, we will determine relationships between the demographic variables and how they collectively impact family income and how income potentially affects time usage. This information could be useful to better inform policy regarding issues relevant to specific demographics.
library(tidyr)
library(dplyr)
library(atus)
data("atuscps")
tibble1 <- as_tibble(atuscps) %>%
na.omit()
library(ggplot2)
library(RColorBrewer)
tibble1 <- as_tibble(atuscps) %>%
na.omit()
plotcps <- ggplot(data=tibble1) +
geom_bar(mapping=aes(x=famincome, fill=home_type)) +
xlab("Family Income") +
ylab("Count") +
scale_color_brewer("Set1")
plotcps + theme(axis.text.x = element_text(angle = 27, size = 6))
library(ggplot2)
library(RColorBrewer)
plotdps <- ggplot(data=tibble1) +
geom_bar(mapping=aes(x=home_type, fill=famincome)) +
xlab("Home Type") +
ylab("Count") +
scale_color_brewer("Set1")
plotdps + theme(axis.text.x = element_text(angle = 27, size = 6))
Color-coded simple stacked bar charts, using ggplot2, were chosen to visualize the data because this method is useful in comparing the total amounts across each segmented group. Additionally, stacked bar charts have the advantages of being able to easily view the data and identify potential outliers.
When considering the relationship between the type of home (own, rent, occupy without rent) and family income, the two stacked bar charts visualize the following. No matter what family income range one falls under, there seems to be worth and demand in owning a house. This means even within a humble, lower family income range, there are still home-owners present. This offers the potentially interesting finding that people of lower socioeconomic classes are still trying to chase “the American Dream” or stability, despite renting being a more feasible option for their financial situation. It must be said of course, that this idea is not generalizable to that socioeconomic class as a whole because as both stacked bar charts suggest visually, for lower socioeconomic classes, the renter population still remains more popular than the homeowner population. On the contrary, we see that in the middle to higher income brackets, there are certainly a much larger count of homeowners compared to its opposing category, renters.
library(atus)
chart = table(atuscps$famincome, atuscps$home_type)
chart
##
## own rent occupy without rent
## < 5000 1291 2718 249
## 5000-7499 1171 2196 102
## 7500-9999 1614 2666 132
## 10000-12499 2588 3097 156
## 12500-14999 2614 2492 108
## 15000-19999 4451 3766 154
## 20000-24999 5771 4086 157
## 25000-29999 6269 3783 170
## 30000-34999 6850 3484 117
## 35000-39999 6547 2747 134
## 40000-49999 11020 3734 156
## 50000-59999 11462 2755 106
## 60000-74999 14442 2621 87
## 75000-99999 20041 2360 86
## 100000-149999 14486 1212 52
## 150000+ 10483 698 20
chisq.test(chart)
##
## Pearson's Chi-squared test
##
## data: chart
## X-squared = 26953, df = 30, p-value < 2.2e-16
As the p-value (<2.2e-16) is less than the .05 significance level, and the chi-squared value is notably large (26953), we reject the null hypothesis and conclude that the two variables, family income and home type (own, rent, occupy without rent), are in fact dependent and have a significant relationship.
library(tidyr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(atus)
atus2 <- cbind(atuscps, atusresp)
atusresp$hh_size <- as.character(as.numeric(atusresp$hh_size))
atus3 <- atus2 %>%
select(famincome, hh_size)
atus4 <- as_tibble(atus3)
chart1 <- ggplot(data=atus4) +
geom_bar(mapping=aes(x=famincome, fill=hh_size)) +
xlab("Family Income") +
ylab("Count") +
scale_color_brewer("Set1")
chart1 + theme(axis.text.x = element_text(angle = 27, size = 6))
library(tidyr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(atus)
atus2 <- cbind(atuscps, atusresp)
atusresp$hh_size <- as.character(as.numeric(atusresp$hh_size))
atus3 <- atus2 %>%
select(famincome, hh_size)
atus4 <- as_tibble(atus3)
chart14 <- ggplot(data=atus4) +
geom_bar(mapping=aes(x=hh_size, fill=famincome)) +
xlab("Household Size") +
ylab("Count") +
scale_color_brewer("Set1")
chart14 + theme(axis.text.x = element_text(angle = 27, size = 6))
The stacked bar charts depicting the relationship between household size (1-16) and family income suggest that there is a varied and potentially similar distribution of household size across each family income range. Further, in taking a closer look at the distribution of family income across household sizes, it is clear that there is a varied distribution, with no income group particularly prominent in each household size. Hence, we must use statistical analyses to see if household size and family income have a significant relationship.
library(atus)
chart2 = table(atuscps$famincome, atusresp$hh_size)
chart2
##
## 1 10 11 12 13 14 15 16 2 3 4 5
## < 5000 1763 2 0 0 0 0 0 0 1010 654 478 217
## 5000-7499 1684 3 0 1 0 0 0 0 736 446 336 169
## 7500-9999 2449 2 0 1 1 0 0 0 858 528 311 169
## 10000-12499 2879 4 3 1 0 0 0 0 1302 707 516 234
## 12500-14999 2376 7 2 0 2 0 0 0 1210 653 509 284
## 15000-19999 3317 2 2 1 0 0 0 0 2181 1179 886 506
## 20000-24999 3586 5 3 1 0 0 0 0 2814 1442 1112 634
## 25000-29999 3257 6 3 2 0 0 0 0 2948 1595 1295 671
## 30000-34999 3115 9 1 2 0 0 0 1 3059 1611 1463 730
## 35000-39999 2575 15 3 1 1 0 0 0 2689 1572 1401 715
## 40000-49999 3656 18 2 2 2 1 0 0 4333 2622 2364 1207
## 50000-59999 2804 13 2 0 1 0 0 0 4155 2641 2705 1262
## 60000-74999 2623 9 4 3 0 0 0 0 4719 3448 3869 1654
## 75000-99999 2396 6 4 3 0 1 1 0 5799 4837 5890 2442
## 100000-149999 1309 7 3 2 0 0 0 0 3754 3379 4685 1831
## 150000+ 723 8 1 0 1 0 0 0 2519 2299 3577 1492
##
## 6 7 8 9
## < 5000 71 49 14 5
## 5000-7499 56 30 7 5
## 7500-9999 75 15 8 2
## 10000-12499 131 42 18 6
## 12500-14999 110 38 17 12
## 15000-19999 185 77 28 12
## 20000-24999 274 88 42 16
## 25000-29999 296 92 41 23
## 30000-34999 296 118 40 18
## 35000-39999 300 106 41 14
## 40000-49999 450 178 61 23
## 50000-59999 509 153 59 27
## 60000-74999 550 186 69 31
## 75000-99999 778 233 85 32
## 100000-149999 578 151 43 26
## 150000+ 441 101 36 13
chisq.test(chart2)
## Warning in chisq.test(chart2): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: chart2
## X-squared = 21089, df = 225, p-value < 2.2e-16
As the p-value (<2.2e-16) is less than the .05 significance level, and the chi-squared value is notably large (21089), we reject the null hypothesis and conclude that the two variables, family income and household size (1-16), are in fact dependent and have a significant relationship.
knitr::opts_chunk$set(echo = TRUE, eval = TRUE, fig.width = 9, fig.height = 5.5,
out.width = "90%", fig.align = "center", cache = T)
The ATUS data is grouped into three separate datasets–atusact, atuscps, and atusresp–that contain information on peoples’ time usage, demographics, and personal information respectively. We will begin by bringing in the atusact dataset and ungrouping it so that we can start manipulating it as we wish.
time_data = ungroup(atusact)
glimpse(time_data)
## Rows: 2,094,140
## Columns: 3
## $ tucaseid <dbl> 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+~
## $ tiercode <int> 10101, 10201, 110101, 120303, 130124, 10101, 10201, 30101, 11~
## $ dur <int> 870, 40, 5, 325, 200, 620, 60, 60, 90, 530, 60, 20, 560, 80, ~
In its original state, atusact contains tiercodes with three, two digit pairs. In this combined format the tiercodes provide very detailed information into the exact activities the individual is performing for whatever length of time; however, what we really want to examine is the overall time various people spend on certain categories of tasks that encompass a broad range of related activities. Therefore, what we need to do is isolate the two digit pairings in the tiercodes to gain a better view of this. As you can see, this simply requires some arithmetic to separate out the various tiers.
time_data2 = time_data %>%
group_by(tiercode) %>%
summarise(tucaseid = tucaseid,
tier1_2Code = tiercode %/% 100,
tier1 = tiercode %/% 10000,
tier2 = (tiercode %% 10000) %/% 100,
tier3 = (tiercode %% 10000) %% 100,
dur = dur) %>%
inner_join(atuscps) %>%
select(tiercode, tier1, dur, region, state, famincome, edu) %>%
drop_na()
## `summarise()` has grouped output by 'tiercode'. You can override using the `.groups` argument.
## Joining, by = "tucaseid"
time_data2
Additionally, as shown above, we want to join the atusact dataset with atuscps so that we can begin analyzing the relationships between peoples’ time usage and various demographics.
First, we will look into the impact of family income on time usage.
time_data3 = time_data2 %>%
group_by(tier1, famincome) %>%
mutate(TDur = sum(dur)) %>%
select(famincome, tier1, TDur) %>%
distinct() %>%
arrange(tier1, TDur)
time_data3
Next, we will examine the impact of region on time usage
time_data4 = time_data2 %>%
group_by(tier1, region) %>%
mutate(RDur = sum(dur)) %>%
select(region, tier1, RDur) %>%
distinct() %>%
arrange(tier1, RDur)
time_data4
And finally, we will look into the impact of educational attainment on time usage
time_data5 = time_data2 %>%
group_by(tier1, edu) %>%
mutate(EDur = sum(dur)) %>%
select(edu, tier1, EDur) %>%
distinct() %>%
arrange(tier1, EDur)
time_data5
For each of our demographic variables (famincome, region, and edu) we will create a stacked bar graph that displays the percentage of the time that each subgroup of the demographic spends on certain types of activities (i.e. Personal Care)
time_data3$tier1 = as.factor(time_data3$tier1)
colourCount = length(unique(time_data3$tier1))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
ggplot(data = time_data3, mapping = aes(x = famincome, y = TDur,
fill = tier1)) +
labs(title = "The Impact of Income on Time Usage Distribution",
subtitle = "Data from the 2003-2016 American Time Usage Survey",
y = "Time Usage (% of Total)", x = "Family Income") +
scale_fill_brewer(palette = "Set1") +
geom_col(position = "fill") +
scale_y_continuous(breaks = seq(0.0, 1.0, 0.1), labels = function(x){
paste0(x * 100, '%')}) +
theme_bw() +
scale_fill_manual(name = "Category", labels = c("Personal Care", "Household",
"Caring for HH Members", "Caring for NonHH Members", "Work",
"Education", "Shopping", "Professional Serv",
"Household Serv", "Gov Serv & Civic Obligations",
"Eating & Drinking", "Socializing & Leisure",
"Exercise & Recreation", "Religious & Spiritual",
"Volunteering", "Telephone Calls", "Traveling"),
values = getPalette(colourCount)) +
coord_flip()
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
Upon cursory inspection, it appears that as family income increases, people tend to spend less time on Personal Care (includes sleeping) and Socializing & Leisure related activities, while spending more time Traveling, on Exercise & Recreation, and on Work related activities.
time_data4$tier1 = as.factor(time_data4$tier1)
ggplot(data = time_data4, mapping = aes(x = region, y = RDur, fill = tier1)) +
labs(title = "The Impact of Region on Time Usage Distribution",
subtitle = "Data from the 2003-2016 American Time Usage Survey",
y = "Time Usage (% of Total)", x = "Region") +
geom_col(position = "fill") +
scale_y_continuous(breaks = seq(0.0, 1.0, 0.1), labels = function(x){
paste0(x * 100, '%')}) +
theme_bw() +
scale_fill_manual(name = "Category", labels = c("Personal Care", "Household",
"Caring for HH Members", "Caring for NonHH Members", "Work",
"Education", "Shopping", "Professional Serv",
"Household Serv", "Gov Serv & Civic Obligations",
"Eating & Drinking", "Socializing & Leisure",
"Exercise & Recreation", "Religious & Spiritual",
"Volunteering", "Telephone Calls", "Traveling"),
values = getPalette(colourCount)) +
coord_flip()
Looking at the resulting bar graph, it appears that region has little impact on time usage as the percentages of time spend on different types of activities across various regions of the United States line up almost perfectly. Although there are some slight deviations, for the most part it seems that region does not influence time usage significantly.
time_data5$tier1 = as.factor(time_data5$tier1)
ggplot(data = time_data5, mapping = aes(x = edu, y = EDur, fill = tier1)) +
labs(title = "The Impact of Educational Attainment on Time Usage Distribution",
subtitle = "Data from the 2003-2016 American Time Usage Survey",
y = "Time Usage (% of Total)", x = "Educational Attainment") +
geom_col(position = "fill") +
scale_y_continuous(breaks = seq(0.0, 1.0, 0.1), labels = function(x){
paste0(x * 100, '%')}) +
theme_bw() +
scale_fill_manual(name = "Category", labels = c("Personal Care", "Household",
"Caring for HH Members", "Caring for NonHH Members", "Work",
"Education", "Shopping", "Professional Serv",
"Household Serv", "Gov Serv & Civic Obligations",
"Eating & Drinking", "Socializing & Leisure",
"Exercise & Recreation", "Religious & Spiritual",
"Volunteering", "Telephone Calls", "Traveling"),
values = getPalette(colourCount)) +
coord_flip()
As one might expect, the trends in the resulting bar graph seem to reflect those that appear in the graph for family income. It is well known that there is a strong correlation between these two variables so this is not that surprising. Most notable among the differences in the visual is that it appears that as educational attainment increases, people spend more time on Work and less time on Socializing & Leisure as well as on Personal Care. We saw these same exact trends in the graph for family income if you recall the result of the earlier code chunk.
Now, we will fit and test various models on the datasets we have created to determine whether these demographics truly have a significant impact on the time usage of people in the ATUS. We will try a number of models and examine the results.
We will start out by conducting some basic tests for linear regression.
mod1 = lm(TDur ~ famincome, data = time_data3)
summary(mod1)
##
## Call:
## lm(formula = TDur ~ famincome, data = time_data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1885303 -797445 -378265 -101556 10611074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 357747 413991 0.864 0.38832
## famincome5000-7499 -66123 585472 -0.113 0.91017
## famincome7500-9999 13159 585472 0.022 0.98209
## famincome10000-12499 132774 585472 0.227 0.82077
## famincome12500-14999 80370 585472 0.137 0.89092
## famincome15000-19999 345194 585472 0.590 0.55598
## famincome20000-24999 483598 585472 0.826 0.40957
## famincome25000-29999 501555 585472 0.857 0.39243
## famincome30000-34999 521621 585472 0.891 0.37380
## famincome35000-39999 434728 585472 0.743 0.45845
## famincome40000-49999 895620 585472 1.530 0.12732
## famincome50000-59999 846636 585472 1.446 0.14938
## famincome60000-74999 1085063 585472 1.853 0.06499 .
## famincome75000-99999 1533661 585472 2.620 0.00933 **
## famincome100000-149999 966683 585472 1.651 0.09994 .
## famincome150000+ 583459 585472 0.997 0.31992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1707000 on 256 degrees of freedom
## Multiple R-squared: 0.06503, Adjusted R-squared: 0.01024
## F-statistic: 1.187 on 15 and 256 DF, p-value: 0.2818
Somewhat unexpectedly based on our observations of the bar graph, it appears that family income is not as influential as we thought on peoples’ time usage. While the higher income brackets do show relatively low p-values, the only factor level that produces a p-value less than .05 is ‘famincome75000-99999’.
mod2 = lm(RDur ~ region, data = time_data4)
summary(mod2)
##
## Call:
## lm(formula = RDur ~ region, data = time_data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5086741 -3044299 -2190062 -355643 30292027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2423961 1563676 1.550 0.126
## regionmidwest 1011083 2211372 0.457 0.649
## regionsouth 2683688 2211372 1.214 0.229
## regionwest 691334 2211372 0.313 0.756
##
## Residual standard error: 6447000 on 64 degrees of freedom
## Multiple R-squared: 0.02427, Adjusted R-squared: -0.02146
## F-statistic: 0.5307 on 3 and 64 DF, p-value: 0.6628
As we expected from the graph we generated, region has little impact on time usage. None of the p-values for the regions are below .05.
mod3 = lm(EDur ~ edu, data = time_data5)
summary(mod3)
##
## Call:
## lm(formula = EDur ~ edu, data = time_data5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3625966 -1980998 -640999 5120 21502217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2195607 906423 2.422 0.0168 *
## eduhs diploma 1443162 1281876 1.126 0.2623
## edusome college 306506 1281876 0.239 0.8114
## eduassociate degree -894382 1281876 -0.698 0.4866
## edubachelor's degree 588046 1281876 0.459 0.6472
## edumaster's degree -989201 1281876 -0.772 0.4417
## eduprof degree -1974559 1281876 -1.540 0.1259
## edudoctoral degree -1962477 1281876 -1.531 0.1283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3737000 on 128 degrees of freedom
## Multiple R-squared: 0.09105, Adjusted R-squared: 0.04134
## F-statistic: 1.832 on 7 and 128 DF, p-value: 0.08655
Again, somewhat surprising results based on the appearance of our graph. The p-values for the various educational attainments are even worse than those for family income and none possess a value below .05.
Lets try grouping the dataset observations by tier, famincome, region, and edu to see if we find anything different.
time_data6 = time_data2 %>%
group_by(tier1, famincome, region, edu) %>%
mutate(Dur = sum(dur)) %>%
select(tier1, famincome, region, edu, Dur) %>%
distinct()
time_data6
mod4 = lm(Dur ~ famincome + region + edu, data = time_data6)
summary(mod4)
##
## Call:
## lm(formula = Dur ~ famincome + region + edu, data = time_data6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102071 -31548 -13718 4322 1128166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8427.54 4231.86 1.991 0.046465 *
## famincome5000-7499 -3816.35 4879.66 -0.782 0.434183
## famincome7500-9999 55.08 4818.25 0.011 0.990879
## famincome10000-12499 4193.11 4808.21 0.872 0.383194
## famincome12500-14999 2620.55 4779.92 0.548 0.583541
## famincome15000-19999 11843.57 4760.34 2.488 0.012868 *
## famincome20000-24999 16398.60 4736.64 3.462 0.000539 ***
## famincome25000-29999 16975.94 4734.40 3.586 0.000338 ***
## famincome30000-34999 17600.74 4714.12 3.734 0.000190 ***
## famincome35000-39999 14753.03 4727.50 3.121 0.001811 **
## famincome40000-49999 29370.38 4671.28 6.287 3.40e-10 ***
## famincome50000-59999 27975.38 4675.53 5.983 2.28e-09 ***
## famincome60000-74999 35430.09 4673.16 7.582 3.79e-14 ***
## famincome75000-99999 49180.31 4651.05 10.574 < 2e-16 ***
## famincome100000-149999 31659.31 4659.33 6.795 1.16e-11 ***
## famincome150000+ 19701.79 4659.55 4.228 2.38e-05 ***
## regionmidwest 8514.51 2351.50 3.621 0.000295 ***
## regionsouth 22214.56 2337.50 9.504 < 2e-16 ***
## regionwest 5875.65 2346.52 2.504 0.012300 *
## eduhs diploma 22456.89 3200.04 7.018 2.44e-12 ***
## edusome college 4728.70 3204.49 1.476 0.140078
## eduassociate degree -14254.41 3212.78 -4.437 9.25e-06 ***
## edubachelor's degree 9207.77 3207.50 2.871 0.004106 **
## edumaster's degree -15479.89 3244.43 -4.771 1.86e-06 ***
## eduprof degree -32806.64 3439.43 -9.538 < 2e-16 ***
## edudoctoral degree -32875.96 3429.05 -9.587 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74240 on 8026 degrees of freedom
## Multiple R-squared: 0.09374, Adjusted R-squared: 0.09092
## F-statistic: 33.21 on 25 and 8026 DF, p-value: < 2.2e-16
The results of this tests are still rather confusing as the p-values don’t line up with the results from earlier tests. The results show both educational attainment and region as having significant impact on time usage despite the results of the earlier tests. Based on this test the most significant predictors are certain levels of each factor. These levels include edudoctoral degree, eduprof degree, regionsouth, and famincome 75000-99999.
Overall, based on the results of the tests and the graphics generated, I think that it is difficult to reach a decisive decision on the impact of the demographic variables Family Income, Region, and Educational Attainment on Time Usage. However, I would say that it seems these three variables do not have that significant of an impact on Time Usage. If I were to rank the impact of the variables it would likely go Family Income, Educational Attainment, and then finally Region.
# Find average family income by educational attainment
avg_income_by_edu <- atuscps
avg_income_by_edu <- avg_income_by_edu %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(education=edu) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
avg_income_by_edu
#Visualize results
ggplot(data=avg_income_by_edu, aes(x=education, y=avg_income)) +
geom_bar(stat='identity', fill = brewer.pal(n=8,name = "Set1")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Investigating income by education")
Here we find the expected result of an apparent trend in income as educational attainment increases.
# Investigate by region
# Find average family income by educational attainment
avg_income_by_edu_region <- atuscps
avg_income_by_edu_region <-avg_income_by_edu_region %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(region,edu) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
avg_income_by_edu_region %>% head(10)
# Compare regions by income
region_df <- avg_income_by_edu_region %>%
group_by(region) %>%
summarise(N=n(),
avg_income = mean(avg_income)) %>%
arrange(avg_income)
ggplot(region_df, aes(x=region, y=avg_income, label=avg_income)) +
geom_bar(stat='identity', fill = brewer.pal(n=4,name = "BuPu")) +
geom_text(aes(label = signif(avg_income, digits = 3)), nudge_y = -2000, color='DarkBlue') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Investigating income by region")
Here the results are less clear, although it would appear that the south has relatively low average income.
# Compare income by region and edu
ggplot(avg_income_by_edu_region, aes(x=region, y=avg_income)) +
geom_col(position='stack') +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90,
nudge_y = -10000, color = 'white') +
facet_grid(~ edu) +
scale_fill_brewer(palette="Set1") +
ggtitle("Investigating income by education and region")
Here we see specific education levels may play a mitigating role in the observed disparity by region.
avg_income_by_race <- atuscps
avg_income_by_race <-avg_income_by_race %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(race,region) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race %>% head(10)
ggplot(avg_income_by_race, aes(x=region, y=avg_income)) +
geom_bar(position='stack', stat='identity') +
scale_fill_manual(palette = "Set1") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90,
nudge_y = -5500, color = 'white') +
facet_grid(~ race) +
ggtitle("Investigating income by race and region")
The results here show a more nuanced picture. The effects of region on income are mitigated by race differently.
avg_income_by_race_edu <- atuscps
avg_income_by_race_edu <-avg_income_by_race_edu %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(race,edu) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race_edu
ggplot(avg_income_by_race_edu, aes(x=edu, y=avg_income)) +
geom_col() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
scale_fill_brewer(palette="Set1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90,
nudge_y = -10000, color = 'white') +
facet_grid(~ race) +
ggtitle("Investigating income by race and education")
The generally linear appearance of the graph is consistent with the above findings excluding the “Other” categor which has an unusal
# Put to the test
atuscps_mid_income <- atuscps
atuscps_mid_income <- atuscps_mid_income %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2)
test_stats <- lm(fam_income_mid ~ edu + region + race, data=atuscps_mid_income)
summary(test_stats)
##
## Call:
## lm(formula = fam_income_mid ~ edu + region + race, data = atuscps_mid_income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80620 -23129 -6326 16812 99352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41965.8 275.7 152.241 < 2e-16 ***
## eduhs diploma 5307.8 255.6 20.770 < 2e-16 ***
## edusome college 13320.0 276.7 48.144 < 2e-16 ***
## eduassociate degree 17348.4 329.5 52.657 < 2e-16 ***
## edubachelor's degree 31020.2 275.1 112.758 < 2e-16 ***
## edumaster's degree 37903.4 353.5 107.211 < 2e-16 ***
## eduprof degree 41615.5 799.1 52.078 < 2e-16 ***
## edudoctoral degree 42897.2 733.4 58.492 < 2e-16 ***
## regionmidwest -2299.0 254.6 -9.031 < 2e-16 ***
## regionsouth -3959.9 238.4 -16.610 < 2e-16 ***
## regionwest -2048.5 262.5 -7.803 6.08e-15 ***
## raceBlack only -12358.8 240.7 -51.341 < 2e-16 ***
## raceOther -5780.7 536.0 -10.784 < 2e-16 ***
## raceAsian only 4506.8 476.8 9.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31370 on 152028 degrees of freedom
## Multiple R-squared: 0.1716, Adjusted R-squared: 0.1715
## F-statistic: 2422 on 13 and 152028 DF, p-value: < 2.2e-16
Here we can see the initial hypothesis was confirmed ane race, education, and region are all significant predictors of family income.
##Background Visualizations
The following bar graph represents the numbers of people of each race in the U.S. population. This provides a frame of reference for the U.S. population composition when viewing later data visualizations.
National Visualization
race_plot <- ggplot(data = atuscps, aes(x = race, fill = race)) +
geom_bar() +
scale_fill_brewer(palette = "Set1") +
labs(title = "Racial Composition of the U.S.", x = NULL, y = NULL, fill = "Race") +
labs(title = "Regional Racial Compositions in the U.S.", x = "Race", y = "Count") +
theme(
plot.title = element_text(hjust = 0.5)
)
race_plot
Regional Visualization The regional racial distributions seem to be similar to the national one. From this visualization, it can be concluded that the most populated region of the U.S. is the South and the least is the Northeast.
race_plot2 <- ggplot(data = atuscps, aes(x = race, fill = race)) +
geom_bar() +
scale_fill_brewer(palette = "Set1") +
labs(title = "Racial Composition of the U.S.", x = NULL, y = NULL, fill = "Race") +
labs(title = "Regional Racial Compositions in the U.S.", x = "Race", y = "Count") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ region)
race_plot2
Other demographic variables, such as citizenship status and country of origin, are relevant to race and may act as subcategories of race.
National Visualization Surprisingly, there seems to be a greater percentage of black people who hold citizenship than that of white people. This may be due to greater immigration from European countries. It is clear that Asians are most likely to not hold citizenship status among the racial categories studided.
citizenship_plot <- ggplot(data = atuscps, mapping = aes(x = race, fill = citizen)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
labs(title = "U.S. Citizenship Status Across Races", x = "Race", y = "Count", fill = "U.S. Citizen") +
theme(
plot.title = element_text(hjust = 0.5)
)
citizenship_plot
Regional Visualization Aside from Asians, every other racial group consistently has at least 87.5% citizenship in each region. In the west, a greater proportion of Asians are citizens compared to those of the three other regions.
citizenship_plot2 <- ggplot(data = atuscps, mapping = aes(x = race, fill = citizen)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
labs(title = "U.S. Citizenship Status Across Races by Region", x = "Race", y = "Count", fill = "U.S. Citizen") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ region)
citizenship_plot2
In the atuscps dataset, the variable listing each respondent’s country of origin is binary; repsondents were asked whether or not their country of origin is the U.S. Both the national and regional visualizations for this variable seem to correspond strongly with the U.S. citizenship status visualizations in the previous section.
National Visualization
country_plot <- ggplot(data = atuscps, mapping = aes(x = race, fill = country_born)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Countries of Origin Across Races", x = "Race", y = "Count", fill = "Country of Origin") +
theme(
plot.title = element_text(hjust = 0.5)
)
country_plot
Regional Visualization
country_plot2 <- ggplot(data = atuscps, mapping = aes(x = race, fill = country_born)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Countries of Origin Across Races by Region", x = "Race", y = "Count", fill = "Country of Origin") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ region)
country_plot2
To visualize the possible relationship between race and income, three representations of income were used : family income brackets, average family income, and the ratio of the top 1% to the bottom 99%.
Observations with missing values have been removed from this dataset to better visualize the income graphs
income <- atuscps %>%
na.omit()
income
Race As expected, whites and Asians are increasingly larger proportions of income brackets as they ascend. Conversely, blacks and other racial groups are decreasingly smaller proportions of income brackets as they ascend. This income disparity could be attributed to economic and social inequality.
race_income <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Racial Breakdown of Each Income Bracket in the U.S.", x = "Family Income Bracket", y = "Count", fill = "Race") +
theme(
plot.title = element_text(hjust = 0.5)
)
race_income
Regional Differences in Income To determine whether the regional income distributions differed from those of the national visualization, the facet_wrap() function was used. Again, the regional trends are similar to those of the national one.
race_income2 <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Racial Breakdown of Each Income Bracket in the U.S. by Region", x = "Family Income Bracket", y = "Count", fill = "Race") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ region)
race_income2
U.S. Citizenship While the proportion of non-citizen blacks in each income bracket remains fairly consistent, the proportion of blacks with U.S. citizenship decreases as the income brackets increase. For whites, the proportion of people decreases for non-citizens and increases for citizens as the income brackets increase. The other racial groups seem to maintain the same distributions regardless of citizenship status. Although this requires further statistical analysis, these results may indicate that citizenship is a significant variable in predicting income.
race_income3 <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Racial Breakdown of Each Income Bracket in the U.S. by Citizenship", x = "Family Income Bracket", y = "Count", fill = "Race") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ citizen)
race_income3
Country of Origin The proportion of whites born outside of the U.S. decreases as the income bracket increases, which corresponds to the trend seen for non-citizen whites in the citizenship visualization. The proportion of blacks born outside of the U.S. remains fairly consistent across each income bracket, which again reflects the citizenship distribution. Distributions for the two other racial groups are similar to those of the national ones.
race_income4 <- ggplot(data = income, mapping = aes(x = famincome, fill = race)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Racial Breakdown of Each Income Bracket in the U.S. by Country of Origin", x = "Family Income Bracket", y = "Count", fill = "Race") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ country_born)
race_income4
The dataset below calculates the average family income for each race within each region.
avg_income_by_race <- atuscps
avg_income_by_race <-avg_income_by_race %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(race) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
avg_income_by_race
Race Surprisingly, Asians have the highest average family income, followed by whites, other, then blacks.
avg_income_plot <- ggplot(data = avg_income_by_race, mapping = aes(x = race, y = avg_income)) +
geom_col() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Average U.S. Family Income by Race", x = "Race", y = "Average Family Income") +
theme(
plot.title = element_text(hjust = 0.5)
)
avg_income_plot
Regional Differences in Income The average family income maintains the same distribution for each region as the general racial one in the previous visualization.
avg_income_by_race2 <- atuscps %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(race, region) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race2
avg_income_plot2 <- ggplot(data = avg_income_by_race2, mapping = aes(x = race, y = avg_income)) +
geom_col() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Regional Average U.S. Family Income by Race", x = "Race", y = "Average Family Income") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ region) +
geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90,
nudge_y = -15000, color = 'white')
avg_income_plot2
U.S. Citizenship For U.S. citizens, the distribution was the same as the ones for the previous two visualizations. Surprisingly, average family income was about $40,000 for all non-citizens except for Asians. Asian non-citizens had a considerably higher average family income than these other racial groups.
avg_income_by_race3 <- atuscps %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(race, citizen) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race3
avg_income_plot3 <- ggplot(data = avg_income_by_race3, mapping = aes(x = race, y = avg_income)) +
geom_col() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Average U.S. Family Income by Citizenship Status Across Race", x = "Race", y = "Average Family Income") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ citizen) +
geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90,
nudge_y = -15000, color = 'white')
avg_income_plot3
Country of Origin The country of origin trend mirrors that of the one for citizenship. This makes sense, as respondents who reported that they were born outside of the U.S. most likely do not hold U.S. citizenship.
avg_income_by_race4 <- atuscps %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2) %>%
group_by(race, country_born) %>%
summarise(N = n(),
avg_income = mean(fam_income_mid)) %>%
arrange(avg_income)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
avg_income_by_race4
avg_income_plot4 <- ggplot(data = avg_income_by_race4, mapping = aes(x = race, y = avg_income)) +
geom_col() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Average U.S. Family Income by Citizenship Status Across Race", x = "Race", y = "Average Family Income") +
theme(
plot.title = element_text(hjust = 0.5)
) +
facet_wrap(~ country_born) +
geom_text(aes(label = signif(avg_income, digits = 3)), angle = 90,
nudge_y = -15000, color = 'white')
avg_income_plot4
First, simple linear regresions were conducted to analyze the individual demographic varaibles from the study. Then, they were combined to form a multiple regression to create a more complex and realistic relationship to income.
income2 <- atuscps %>%
separate(famincome, into=c('income_low','income_high'), sep='-',convert=TRUE)%>%
mutate(income_low = as.integer(income_low)) %>%
drop_na() %>%
mutate(fam_income_mid = (income_high+income_low)/2)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15474 rows [205,
## 211, 248, 292, 311, 377, 400, 431, 504, 535, 554, 555, 581, 596, 611, 616, 655,
## 662, 710, 766, ...].
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
income2
mod1 <- lm(fam_income_mid ~ race, data = income2)
summary(mod1)
##
## Call:
## lm(formula = fam_income_mid ~ race, data = income2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61183 -27646 -10146 27564 85064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55145.11 96.74 570.06 <2e-16 ***
## raceBlack only -15210.02 255.65 -59.50 <2e-16 ***
## raceOther -8671.35 578.68 -14.98 <2e-16 ***
## raceAsian only 12286.94 511.35 24.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33970 on 152038 degrees of freedom
## Multiple R-squared: 0.02868, Adjusted R-squared: 0.02866
## F-statistic: 1496 on 3 and 152038 DF, p-value: < 2.2e-16
mod2 <- lm(fam_income_mid ~ citizen, data = income2)
summary(mod2)
##
## Call:
## lm(formula = fam_income_mid ~ citizen, data = income2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47881 -26631 -9131 33369 82514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42485.2 320.5 132.56 <2e-16 ***
## citizenyes 11644.9 333.3 34.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34330 on 152040 degrees of freedom
## Multiple R-squared: 0.007964, Adjusted R-squared: 0.007957
## F-statistic: 1221 on 1 and 152040 DF, p-value: < 2.2e-16
mod3 <- lm(fam_income_mid ~ country_born, data = income2)
summary(mod3)
##
## Call:
## lm(formula = fam_income_mid ~ country_born, data = income2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47988 -26738 -9238 33262 77475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54237.79 95.48 568.06 <2e-16 ***
## country_bornnon-US -6712.89 249.07 -26.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34390 on 152040 degrees of freedom
## Multiple R-squared: 0.004755, Adjusted R-squared: 0.004749
## F-statistic: 726.4 on 1 and 152040 DF, p-value: < 2.2e-16
To represent the more complex relationships between all of these demographic variables and income, I used multiple regression
mod4 <- lm(fam_income_mid ~ race + citizen + country_born, data = income2)
summary(mod4)
##
## Call:
## lm(formula = fam_income_mid ~ race + citizen + country_born,
## data = income2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69055 -27193 -8443 26557 99441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46051.0 462.2 99.63 <2e-16 ***
## raceBlack only -15564.4 254.0 -61.28 <2e-16 ***
## raceOther -8503.4 574.7 -14.80 <2e-16 ***
## raceAsian only 18797.7 535.9 35.08 <2e-16 ***
## citizenyes 10455.4 451.8 23.14 <2e-16 ***
## country_bornnon-US -4927.7 348.4 -14.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33730 on 152036 degrees of freedom
## Multiple R-squared: 0.04217, Adjusted R-squared: 0.04214
## F-statistic: 1339 on 5 and 152036 DF, p-value: < 2.2e-16
Based on the results of the linear regressions conducted, all of the demographic variables studied are significant due to their small p-values (<2e-16). Unfortunately, this means that the impacts and importance of each variable cannot be distinguished since they returned the same p-values and are all concluded to be significant. In conclusion, race, citizenship, and country of origin are all significant predictors of family income.